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Abstract 



m 

Predictive modelling of multivariate data where both the covariates and responses are high- 
rjQ ' dimensional is becoming an increasingly popular task in many data mining applications. Partial 

Least Squares (PLS) regression often turns out to be a useful model in these situations since it 
performs dimensionality reduction by assuming the existence of a small number of latent factors 
that may explain the linear dependence between input and output. In practice, the number of latent 
factors to be retained, which controls the complexity of the model and its predictive ability, has 
to be carefully selected. Typically this is done by cross validating a performance measure, such 
as the predictive error. Although cross validation works well in many practical settings, it can be 
computationally expensive. Various extensions to PLS have also been proposed for regularising 
the PLS solution and performing simultaneous dimensionality reduction and variable selection, 
but these come at the expense of additional complexity parameters that also need to be tuned by 
qq ' cross-validation. In this paper we derive a computationally efficient alternative to leave-one-out 

I/"") , cross validation (LOOCV), a predicted sum of squares (PRESS) statistic for two-block PLS. We 

show that the PRESS is nearly identical to LOOCV but has the computational expense of only 
a single PLS model fit. Examples of the PRESS for selecting the number of latent factors and 
C*~) ' regularisation parameters are provided. 

1 Introduction 

IS. 

In this work we consider regression settings characterised by an X £ M nxp matrix of p covariates 
and an Y G R nx<? matrix of q responses, both observed on n objects. Assuming centred data, the 
standard regression approach consists of fitting a Multivariate Linear Regression (MLR) model, that 
is Y = X(3 + e, where (3 = (X 1 X)~ l X T Y £ W xq is a matrix of regression coefficients and 
e € M. nxq is the matrix of uncorrected, mean-zero errors. In many situations where the dimensionality 
of the covariates is very high, the problem of multicollinearity prevents X T X from being invertible. 
Furthermore, it can be shown that the columns, \J3i, /3 q ], of the matrix of regression coefficients 
P are in fact the coefficients of the regression of X on the individual response variables, [yi, ...,y q ], 
respectively (see, for instance, [1]). This implies that the least squares estimate for j3 is equivalent 
to performing q separate multiple regressions. Therefore, the MLR solution contains no information 
about the correlation between variables in the response. 

A common solution to the problems introduced by multicollinearity and multivariate responses 
involves imposing constraints on (3 which improves prediction performance by effectively perform- 
ing dimensionality reduction. Two-block Partial Least Squares (PLS) is a technique which performs 
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simultaneous dimensionality reduction and regression by identifying two sets of latent factors un- 
derlying the data which explain the most covariance between covariates and response where the two 
blocks refer to the setting where both the covariates and the response are multivariate. These latent 
factors are then used for prediction rather than the full data matrices X and Y. As such, PLS is used 
in many situations where the number of variables is very large. Two-block PLS is described in detail 
in section [2 

There are several modern applications for which two-block PLS regression is particularly suit- 
able. PLS is widely used in the field of chemometrics where there are many problems which involve 
predicting the properties of chemicals based on their composition where the latent factors capture the 
underlying chemical processes [2 ]. In computational finance, PLS has been used to identify the latent 
factors which explain the movement of different markets in order to predict stock market returns [0. 

Closely related is the field of multi-task learning in which many tasks with univariate output are 
learned simultaneously. Using information from multiple tasks improves the performance and so can 
be treated as one large problem with a multivariate response (see, for example [4]). For example, there 
are similarities in the relevance of a particular web page to a given query depending on geographical 
location and so learning the predictive relationship between query terms and web pages across mul- 
tiple locations helps improve the relevance of the returned pages (H. In applications involving web 
search data, n can be extremely large, in the order of hundreds of thousands. 

One critical issues that arises when using PLS regression to attack such problems relates to the 
selection of the number of latent factors, R, to include in the model. The choice of R is extremely 
important: if R is too small, important features in the data may not be captured; if R is too large 
the model will overfit. Furthermore, the number of latent factors can be important in interpreting 
the results. Recently, several extensions to PLS have been proposed to improve prediction in several 
settings. When there are many irrelevant variables which do not contribute to the relationship between 
X and Y (i.e. the underlying model of the data is sparse), we can regularize the latent factors to 
remove the contribution of noise variables from the regression coefficients. However, better prediction 
performance comes at the cost of introducing more important parameters such as the degree of sparsity 
which must be tuned. The success of these and the many other extensions to PLS depends on the 
careful tuning of these parameters. 

For decades, performing iv~-fold cross validation (CV) on the prediction error has been a popular 
tool for model selection in linear models HIT]. In PLS regression, model selection has also been 
commonly performed using Leave-one-out cross validation (LOOCV) [8] and Iv'-fold CV |9). The 
CV procedure involves repeated fitting of the model using subsets of the data in the following way. 
The data is split into K equal sized groups and the parameters are estimated using all but the k th group 
of observations which is used for testing. The K groups are each left out in turn and the iv'-fold cross 
validated error is given by 

1 K 

k=l 

where the subscript k denotes only the observations in the k th group are used whereas (k) denotes the 
estimate of (3 obtained by leaving out the k th group. 

The choice of K is important as it has implications on both the accuracy of the model selection 
as well as its computational cost. When K = n we obtain leave-one-out cross validation (LOOCV) 
where the parameters are estimated using all but one observation and evaluated on the remaining 
observation. LOOCV is a popular choice for model selection as it makes most efficient use of the data 
available for training. LOOCV also has the property that as the number of samples increases, it is 
asymptotically equivalent to the Akaikie Information Criterion (AIC) which is a commonly used for 
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model selection in a variety of applications [10]. However LOOCV can be extremely computationally 
expensive if n or p is large. 

The use of these techniques are subject to the same problems as in OLS, however, since PLS is 
typically used in situations where the data is very high dimensional and the sample size is small, the 
problems can be amplified. When n is small compared to p and q, constructing K cross-validation 
sets is wasteful and the resulting model selection becomes less accurate. Similarly, although LOOCV 
selects the true model with better accuracy under these conditions, the computational cost becomes 
prohibitive when p, q or n are large since the complexity of PLS is of order 0(n(p 2 q + q 2 p)) ifTTI . 
Problems of performing model selection in PLS are discussed by [12J. 

It has long been known that for ordinary least squares, the LOOCV of the prediction error has 
an exact analytic form known as the Predicted Sum of Squares (PRESS) statistic [13]. Using this 
formulation it is possible to rewrite the LOOCV for each observation as a function of only the residual 
error and other terms which do not require explicit partitioning of the data. We briefly review this 
result in Section 13.11 However, such a method for computing PRESS efficiently for two-block PLS 
has not been developed in the literature. In this work we derive a analytic form of PRESS for two- 
block PLS regression based on the same techniques as PRESS for OLS which we present in section 
13.21 In section [331 we show that, under mild assumptions, the PRESS is equivalent to LOOCV up to 



an approximation error of order 0(y -^fp). In section [4] we illustrate how the PLS PRESS can be 
used for efficient model selection with an application to Sparse PLS where the PRESS is used to select 
the number of latent factors R and the regularization parameter controlling the degree of sparsity in 
the regression coefficients. Finally, we report on experiments performed using data simulated under 
the sparse PLS model and show that the PRESS performs almost exactly as LOOCV but at a much 
smaller computational cost. 

2 PLS Regression 

PLS is a method for performing simultaneous dimensionality reduction and regression by identifying 
a few latent factors underlying the data which best explain the covariance between the covariates and 
the response. Regression is then performed using these lower dimensional latent factors rather than 
the original data in order to obtain better prediction performance since only the features in the data 
which are important for prediction are used. 



The two-block PLS model assumes X and Y are generated by a small number R, of latent factors 

rami mi 



where the columns of T and S € W lxH are the R latent factors of X and Y and P G W xR and 
q G k</x-R are the factor loadings. E x € R nxp and E y G R nxq are the matrices of residuals. The 
latent factors are linear combinations of all the variables and are found so that 



where U € W xH and V G R qxH are found by computing the singular value decomposition (SVD) 
of M = X T Y as M = UGV T where U and V are orthogonal matrices and G is a diagonal matrix of 
singular values. There exists an "inner-model" relationship between the latent factors 




X = TP T + E X , Y = SQ J + E, : 



Cov(T, S) 2 




u,v 



S = TD + H, 



(1) 
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where D E M. is a diagonal matrix and H is a matrix of residuals so that the relationship between 
X and Y can be written as a regression problem 

Y = TDQ T + (HQ T + E y ) 

= xp + e;, 

where f} = U DQ T . Since U and V are computed using the first R singular vectors of the SVD of M, 
the R PLS directions are orthogonal and so the latent factors can be estimated independently. 

3 PRESS statistic 

3.1 The PRESS statistic for OLS regression 

In the least squares framework, leave one out cross validation is written as 

E L oocv = lir(y*- x ^ OLS ( i tf> 
i=i 

where f3 are the OLS regression coefficients. The subscript i denotes the i th observation whereas 
(i) denotes the estimate of /3 using the observations (1, i — 1, i + 1, n). Given a solution 

j ^ i is estimated by adding the j th observation and removing the i th observation. Estimat- 
ing f3 requires computing the inverse covariance matrix, P = (X T X)~ 1 which is computationally 
expensive. However, since each /3(i) is different from j3 by only one sample, we can easily compute 
P(i) = (X (i) T X (i))" 1 from P using the Morrison-Sherman-Woodbury theorem without the need to 
perform another matrix inversion iTLTI : 

(xiifx^y 1 = (x T x-xj Xl y 1 

__ PXjxJP 

~ + i-hr 

where hi = xJPxi. This allows the leave-one-out estimate, to be written as a function of (3 in 
the following way, without the need to explicitly remove any observations 

P(i) = (X( i ) J X(i))- 1 X(i) T y( i ) 
(yi - Xif3)Pxj \ 
1-hi )' 

Finally, the i th LOOCV residual can simply be written as a function of the i th residual error and does 
not require any explicit permutation of the data, as follows 



In the next section we derive a similar formulation for the PRESS for PLS. 
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3.2 A PRESS statistic for two-block PLS 



As described in section 12 estimating the PLS regression coefficients involves first estimating the inner 
relationship between latent factors and then the outer relationship between S and Y. Both of these 
steps are the result of lower dimensional least squares regression problems. Using the same techniques 
as in section |3~Tl we can derive a similar expression for the PLS PRESS statistic in two steps. However, 
in PLS the latent factors are estimated by projecting the data onto the subspace spanned by U and V 
respectively which are found using the SVD of M = UGV T . 

Our developments rely on a mild assumption: provided the sample size n is sufficiently large, any 
estimate of the latent factors obtained using n — 1 samples is close enough to the estimate obtained 
using n data points. In terms of the PLS model, this assumption relates to the difference between 
g r (M n ), the r th singular value of M n (which has been estimated using n rows of X and Y) and 
g r (M n _i), estimated using n — 1 rows. 

Formally, we assume that the following inequality 

\g r {M n ) - ffr (M n _i)| < e (2) 

holds for 1 < r < R where the approximation error, e is arbitrarily small. Since the rank r approx- 
imation error of the SVD is given by g r +i, if the difference between the pairs of singular values is 
small, it implies the difference between the corresponding pairs of singular vectors is also small. In 
other words, within the LOOCV iterations, it is not necessary to recompute the SVD of X(i) T Y(i). 
We show that this assumption holds in Section [331 

This assumption Q implies that the i th PLS prediction error is e(i) = y% — XiUD(i)Q(i) T . Since 
the PLS inner model coefficient, D in Eq (Q]) is estimated using least squares, we can derive an efficient 
update formula for D(i) as a function of D using the Morrison-Sherman- Woodbury theorem 

D(i) = (T(^T(i)y 1 T(i)^S(i) 
( Si - tiD)P t ti 



D 



l-h t 



where P t = (T T T) _1 is the inverse covariance matrix of the A— latent factors and h tj i = t\Pth is 
analogous to the OLS hat-matrix. Similarly, the Y — loading Q is also obtained using least squares and 
we can derive a similar update formula as follows 

Q(i) =(Q- {m - S * Q)PsSt 



l-h s 

where P s = (S J S)~ 1 and h Sj i = sJP s Si. Under the assumption ©, U and V are fixed and so these 
recursive relationships allow us to formulate the i th PLS prediction error, e(i) in terms of the i th PLS 
regression residual error ej in the following way for one latent factor: 

R 



r=l 

R 
r=l 



1 + a ^ —a — b 



(l-h s ) (l-h t ){l-h s ) 



where the following identities from the PLS model have been used: SiQ = Hi — E y i U = X{U, 
Si = XiUD and /3 = XiUDQ T . where: a = 1 — E y /ei and b = yiHiP s Si/ ei. 
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Since each of the PLS latent factors is estimated independently, for R > 1 the overall leave one 
out error is simply the sum of the individual errors 

R 

e (i) = ^ r \t)- Vi (R-l). 

r=l 

Finally, the overall PRESS statistic is computed as 



n 
i=i 



E P ress = -Y j \Hi)\\1, (3) 



where ||-|| 2 is the squared Euclidean norm. 



3.3 A bound on the approximation error 

The assumption, © is key to developing an efficient version of the PRESS for PLS. It ensures that the 
SVD need only be computed once with all the data instead of n times in a leave-one-out fashion which 
results in a computational saving of 0(n(p 2 q + q 2 p)) ifTTTl which becomes prohibitively expensive 
when both p and q are large. From a conceptual point of view, this assumption implicitly states that 
underlying the data there are true latent factors which contribute most of the covariance between X 
and Y. Therefore, removing a single observation should not greatly affect the estimation of the latent 
factors. In this section we formalise this assumption by introducing a theorem which places an upper 
bound on error e. 

In presenting the theorem, we first rely on two Lemmas. The first result, from details an upper 
bound on the difference between the expected value of a covariance matrix, of a random vector x and 
its sample covariance matrix using n samples. 

Lemma 1 (adapted from 11610 Let x be a random vector in MP. Assume for normalization that 
||Ex r x|| 2 < 1. Let X\, x n be independent realizations of x. Let 



m 



n 



where C is a constant and A = \\x\\ 2 . Ifm < 1 then 



E 



n 

n ^ 

i=i 



Xj Xi 



Kx T x 



< m n 



The second Lemma is a result from matrix perturbation theory which details an upper bound on the 
maximum difference between the singular values of a matrix M and a perturbation matrix M + E. 

Lemma 2 (adapted from Ell) For M,E € M. nxp , 

max \ 9i (M + E) — gi {M)\ < \\E\\ 2 



We are now able to state our result. 

Theorem 1 Let M n be the sample covariance matrix of X n € M. nxp and M n -\ be the sample co- 
variance matrix of X n ^\ € M( n ~ 1 ) x P. g r (M) is the r th ordered singular value of M then 

max 1 5i(M n ) -^(M n _i)| < m n _i 
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Proof: To prove this theorem, we first establish a bound on the error between the sample covariance 
matrices M n and M n _i by applying Lemma Q] with n and n — 1 samples to obtain the following 
inequalities 

\\M-M n \\ 2 < m n (4) 
||M-M n _i|| 2 < m n _i (5) 

subtracting Eq © from Eq d5) and applying Minkowski's inequality we arrive at an expression for the 
difference between terms M n and M n _i as follows 

\\M - M„_i|| 2 - ||M„ - M n _i|| 2 < m n - m n _i 



||M n - Af n _i|| 2 < m n _i (6) 

We now relate this result to the difference between computing the SVD of M n and the SVD of M n _i 
by recognizing that M n is obtained as a result of perturbing M n -\ by M\ = x\x\ where x\ € M lxp 
is the single observation missing from M n . Using Lemma[2]and Eq Q we obtain 

max|^(M n _i + Mi)- ft (M n _i)| < ||M n - M n _i|| 2 

< m n -i (7) 

which proves the theorem. This 
theorem details an upper bound on the maximum difference between pairs of ordered singular values 
of the covariance matrix of M estimated with all n observations and the covariance matrix estimated 
with n—1 observations. Since A and the constant C do not depend on n and so are the same for m n 

and m n -i. Therefore, the value of the error term defined by the bound decreases as 0(y^!p). 



4 Model selection in Sparse PLS 

Although the dimensionality reduction inherent in PLS is often able to extract important features 
from the data, in some situations where p and q are very large, there may be many irrelevant and noisy 
variables which do not contribute to the predictive relationship between X and Y. Therefore, for the 
purposes of obtaining a good prediction and for interpreting the resulting regression coefficients, is it 
important to determine exactly which variables are the important ones and to construct a regression 
model using only those variables. Sparse PLS regression has found many applications in various 
areas, including genomics JH, computational finance lfl2l and machine translation |[T8l . 

A Sparse PLS algorithm 02] can be obtained by rewriting the SVD, M = UGV T as a LASSO 
penalized regression problem 

II T II 2 

min M-ra 9 + 7ll^lli s -t- IHI2 = 1 > (8) 

u,v 

where u and v G M pxl are the estimates of and the first left and right singular vectors 
respectively. As such, they are restricted to be vectors with unit norm so that a unique solution may be 
obtained. The amount of sparsity in the solution is controlled by 7. If 7 is large enough, it will force 
some variables to be exactly zero. The problem of Eq. d8} can be solved in an iterative fashion by first 
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setting u = and v = as before. The Lasso penalty can then be applied as a component-wise 
soft thresholding operation on the elements of u (see, for instance, {19]). The sparse u are found by 
applying the threshold component-wise as follows: 

u* = sgn (Hv ) (\Hv\ — 7) + 
v* = Hu*/ \\Hu*\\ 2 . 

Typically, selecting the optimal sparsity parameter, A involves evaluating the CV error over a grid 
of points in [X m i n , \ ma x\- However, since the behaviour of the CV error as a function of A is not 
well understood this often requires specifying a large number of points which exacerbates the already 
large computational burden of performing cross validation. In such cases, model selection is often 
performed using problem-specific knowledge and intuition. Problems of performing model selection 
in sparse PLS are discussed by [12]. Since the PLS PRESS is computationally efficient for large n,p 
and q we can evaluate the Sparse PLS model over a large grid of values of A and quickly compute 
the PRESS to determine the best cross-validated degree of sparsity. Some experimental results are 
presented in the following Section. 

5 Experiments 

In this section we report on the performance of the PRESS in experiments on simulated data where 
the predictive relationship between the covariates and the response is dependent on latent factors 
according to the PLS model in Eq (OQ) where E x € R nxp and E y G M. nxq are matrices of i.i.d 
random noise simulated from a normal distribution, N(0, 1). For a fixed value of R, n and p = q 
we simulate R pairs of latent factors t and s of length n from a bivariate normal distribution in 
descending order of covariance with means drawn from a uniform distribution. Secondly we simulate 
R separate pairs of loading vectors u and v of length p and q respectively from a uniform distribution, 
U(0, 1). In order to ensure the contribution of each latent factor is orthogonal, we orthogonalise 
the vectors U = [u\ , . . . , ur] with respect to each other using the QR decomposition, similarly for 
V = [vi, ...,v R ]. 

To test the performance of the PRESS for selecting the number of latent factors, we perform a 
Monte Carlo simulation where for each iteration we draw R as an integer from U(2, 8) so that the 
true number of latent factors in the simulated data is constantly changing. We measure the sensitivity 
of model selection using the PRESS and LOOCV by comparing the number of latent factors which 
minimizes these quantities minimum with the true value of R. 

To test the performance of the PRESS for sparse PLS, we use the same simulation setting except 
now we fix R = 1 and induce sparsity into the latent factor loadings for X, u r . Now ||u r || = p/j 
which implies that only p/j of the p elements in u r are non-zero and thus contribute to the predictive 
relationship between X and Y. By altering j, we change how many of the variables in X are useful 
for predicting Y. 

We perform a Monte Carlo simulation whereby for each iteration we randomize the true number 
of important variables in X by drawing j from [7(1, 2) so that up to half of the variables in X may 
be unimportant. We evaluate sparse PLS over a grid of 100 values of 7 which span the parameter 
space. We measure the sensitivity of model selection using the PRESS as compared to LOOCV by 
comparing the selected variables with the truly important variables. 

Table I reports on the ratio between the sensitivity achieved by the PRESS, impress an d the 
sensitivity of LOOCV, itloocv f° r both of these settings averaged over 200 trials for different values 
of n, p and q. When selecting R, PRESS and LOOCV achieve almost exactly the same sensitivity 
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for all values with only very small discrepancies when n < p. When selecting 7 the error between 
PRESS and LOOCV is noticeable when p and q are large compared with n. The standard error of the 
sensitivity ratio (in parenthesis) also increases when p and q are large. 







71 'PRESS h 'LOOCV 


n 


P,Q 


Selecting R 


Selecting 7 




100 


1.0 


(0.303) 


0.92 


(0.228) 


50 


500 


0.98 


(0.390) 


0.73 


(0.344) 




1000 


0.95 


(0.351) 


0.70 


(0.333) 




100 


1.0 


(0.141) 


0.99 


(0.164) 


100 


500 


0.99 


(0.412) 


0.91 


(0.190) 




1000 


0.99 


(0.389) 


0.90 


(0.203) 




100 


1.0 


(0.0) 


1.0 


(0.071) 


200 


500 


1.0 


(0.095) 


0.95 


(0.055) 




1000 


1.0 


(0.332) 


0.91 


(0.158) 




100 


1.0 


(0.0) 


1.0 


(0.014) 


300 


500 


1.0 


(0.1) 


1.0 


(0.055) 




1000 


1.0 


(0.127) 


0.96 


(0.11) 




100 


1.0 


(0.0) 


1.0 


(0.0) 


500 


500 


1.0 


(0.095) 


1.0 


(0.0) 




1000 


1.0 


(0.12) 


1.0 


(0.017) 



Table 1: Comparing the ratio of the sensitivity, ir (the proportion of times the correct model is chosen), 
press I LOOCV when selecting R and 7 as a function of n, p and q. The value in parenthesis is the 
Monte Carlo standard error. 

Figure Q] compares the computational time using PRESS and LOOCV for selecting 7 as a func- 
tion of n for different values of p and q. We report on computational timings using a 2.0GHz Intel 
Core2Duo with 4GB of RAM. Relative timing is measured using the tic, toe function in Matlab 
v7.8.0. It can be seen that increase in computation time for LOOCV is quadratic as a function of n and 
the number of variables. In comparison, the increase in computation time for PRESS is linear in these 
quantities and very small relative to LOOCV Because of this, it becomes computationally prohibitive 
to perform LOOCV with a greater number of observations or variables than we have presented. 

Figure [2] reports on the approximation error between LOOCV and PRESS for p = q = 100 
as a function of n. As the number of samples increases, it can be seen that the error decreases as 

In the simulations we have focussed on situations where the response is multivariate. However, 
the case where the response is univariate (q = 1) also commonly occurs. In this situation, the latent 
factor for Y collapses to a scalar and so we would expect the error between LOOCV and PRESS to 
be smaller. 

6 Conclusions 

We have showed that in order to obtain good prediction in high dimensional settings using PLS, a 
computationally efficient and accurate model selection method is needed to tune the many crucial 
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Comparison of computation time 











PRESS, p=100 






PRESS, p-500 






— LOOCV, p=100 
LOOCV, p=500 
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/ 
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1 F " — r 1 1 1 1 1 

1 00 200 300 400 500 600 700 800 900 1000 



number of samples, n 



Figure 1: Comparison of computational timing between PRESS and LOOCV. For fixed values of p 
and q, computational time (in seconds) for PRESS is linear in the number of observations, n whereas 
for LOOCV, the computational time increases as a function of n 2 . PRESS is also linear in the number 
of variables whereas LOOCV is quadratic in this quantity. 




o 1 1 1 1 1 1 1 1 — 1 

100 200 300 400 500 600 700 800 900 1000 
number of samples, n 



Figure 2: The approximation error between LOOCV and PRESS. For a fixed p and q = 100, the 
approximation is small. As the number of samples, n increases, the error between LOOCV and 

PRESS decreases as 0(y ^° s ^ n ' ). See table|5]for further results. 
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parameters. In this work we have derived an analytic form for the PRESS statistic for two-block PLS 
regression and we have proved that under mild assumptions, the PRESS is equivalent to LOOCV. 

We have also presented an example application where the PRESS can be used to tune additional 
parameters which improve prediction and interpretability of results in the case of sparse PLS. We 
have showed through simulations that using the PRESS to tune the number of latent factors, R and the 
sparsity parameter 7, performs almost identically to LOOCV which is the method most commonly 
used in the literature at a far lower computational cost. When the number of samples is large, LOOCV 
and PRESS perform identically. 

Although we have showed that the analytic PRESS statistic for two-block PLS regression is an 
important contribution and can be easily applied to the many settings where parameters must be tuned 
accurately, there are still opportunities for further work. Another such possibility is to construct an 
objective function in terms of 7, the regularization parameter, so that the optimal degree of sparsity 
can be tuned automatically. 
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